{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3d8dca77-3634-44fc-b253-56cadeaf69c6",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -r requirements.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "01803cf6-a631-4dcc-85e1-8d10235780d2",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "from getpass import getpass\n",
    "\n",
    "from elasticsearch import Elasticsearch, helpers\n",
    "from sentence_transformers import SentenceTransformer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2b16308f-b9f3-4ce1-9250-3386bf8342f1",
   "metadata": {},
   "outputs": [],
   "source": [
    "cloud_endpoint = getpass(\"Elastic deployment Cloud Endpoint: \")\n",
    "cloud_api_key = getpass(\"Elastic deployment API Key: \")\n",
    "INDEX_NAME = \"books-pipeline\"\n",
    "\n",
    "es = Elasticsearch(\n",
    "    hosts=cloud_endpoint,\n",
    "    api_key=cloud_api_key,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "83c32fd8-e199-45e2-99d3-d835b9cf4c23",
   "metadata": {},
   "source": [
    "### 1. Create an Ingestion Pipeline\n",
    "We will create an inference ingestion pipeline to have Elasticsearch create embeddings of the book_description when the document is indexed into Elasticsearch. This frees our hardware from needing to embed vectors.\n",
    "\n",
    "Note that if there is any kind of failure, the documents will be placed in a `failed-books` index and will include helpful error messages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c09c0b98-775d-47fd-bfcf-4e84a352e092",
   "metadata": {},
   "outputs": [],
   "source": [
    "resp = es.ingest.put_pipeline(\n",
    "    id=\"text-embedding\",\n",
    "    description=\"converts book description text to a vector\",\n",
    "    processors=[\n",
    "        {\n",
    "            \"inference\": {\n",
    "                \"model_id\": \"sentence-transformers__msmarco-minilm-l-12-v3\",\n",
    "                \"input_output\": [\n",
    "                    {\n",
    "                        \"input_field\": \"book_description\",\n",
    "                        \"output_field\": \"description_embedding\",\n",
    "                    }\n",
    "                ],\n",
    "            }\n",
    "        }\n",
    "    ],\n",
    "    on_failure=[\n",
    "        {\n",
    "            \"set\": {\n",
    "                \"description\": \"Index document to 'failed-<index>'\",\n",
    "                \"field\": \"_index\",\n",
    "                \"value\": \"failed-{{{_index}}}\",\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"set\": {\n",
    "                \"description\": \"Set error message\",\n",
    "                \"field\": \"ingest.failure\",\n",
    "                \"value\": \"{{_ingest.on_failure_message}}\",\n",
    "            }\n",
    "        },\n",
    "    ],\n",
    ")\n",
    "\n",
    "print(resp)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f10b5e28-6f17-426e-81e9-9b81ee3dc9c7",
   "metadata": {},
   "source": [
    "### 2. Create an index\n",
    "Now lets create an index in Elasticsearch. We will not need to map our description_embedding vector data type as the ingestion pipeline will provide that for us."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d7e1ca60-05fa-481b-b701-ca6a4573e156",
   "metadata": {},
   "outputs": [],
   "source": [
    "mappings = {\n",
    "    \"mappings\": {\n",
    "        \"properties\": {\n",
    "            \"book_title\": {\"type\": \"text\"},\n",
    "            \"author_name\": {\"type\": \"text\"},\n",
    "            \"rating_score\": {\"type\": \"float\"},\n",
    "            \"rating_votes\": {\"type\": \"integer\"},\n",
    "            \"review_number\": {\"type\": \"integer\"},\n",
    "            \"book_description\": {\"type\": \"text\"},\n",
    "            \"genres\": {\"type\": \"keyword\"},\n",
    "            \"year_published\": {\"type\": \"integer\"},\n",
    "            \"url\": {\"type\": \"text\"},\n",
    "        }\n",
    "    }\n",
    "}\n",
    "# Delete any previous index\n",
    "es.indices.delete(index=INDEX_NAME)\n",
    "es.indices.create(index=INDEX_NAME, body=mappings)\n",
    "print(f\"Index '{INDEX_NAME}' created.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d895af24-739c-4890-a7dc-bd27907df981",
   "metadata": {},
   "source": [
    "### 3. Bulk Indexing many documents\n",
    "Now that we have created an index in Elasticsearch, we can index our local book objects. This bulk_ingest_books method will make indexing documents much faster than if we were to run an index function on each individual book.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5dfc0eb6-3480-437d-a16f-d2486a2fd99c",
   "metadata": {},
   "outputs": [],
   "source": [
    "file_path = \"../data/books.json\"\n",
    "with open(file_path, \"r\") as file:\n",
    "    books = json.load(file)\n",
    "\n",
    "# create an array of index actions, with each element holding one document\n",
    "actions = [\n",
    "    {\"_index\": INDEX_NAME, \"_id\": book.get(\"id\", None), \"_source\": book}\n",
    "    for book in books\n",
    "]\n",
    "\n",
    "try:\n",
    "    helpers.bulk(es, actions, pipeline=\"text-embedding\", chunk_size=1000)\n",
    "    print(f\"Successfully added {len(actions)} books into the '{INDEX_NAME}' index.\")\n",
    "\n",
    "except helpers.BulkIndexError as e:\n",
    "    print(f\"Error occurred while ingesting books: {e}\")\n",
    "    print(e.errors)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d10ae8dd-5dd9-4a91-a9b1-8dc17bdb86ba",
   "metadata": {},
   "source": [
    "### 4. Indexing one document.\n",
    "Lets also add one book, as this would be a standard function as you add new books to your vector database"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "5d0930c7-4f77-426d-9180-7b7f204c2773",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Successfully indexed book: Shattered World - Result: created\n"
     ]
    }
   ],
   "source": [
    "file_path = \"../data/one_book.json\"\n",
    "with open(file_path, \"r\") as file:\n",
    "    book = json.load(file)\n",
    "\n",
    "try:\n",
    "    resp = es.index(\n",
    "        index=INDEX_NAME,\n",
    "        id=book.get(\"id\", None),\n",
    "        body=book,\n",
    "        pipeline=\"text-embedding\",\n",
    "    )\n",
    "\n",
    "    print(\n",
    "        f\"Successfully indexed book: {book.get('book_title', None)} - Result: {resp.get('result', None)}\"\n",
    "    )\n",
    "\n",
    "except Exception as e:\n",
    "    print(f\"Error occurred while indexing book: {e}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "601c6f60-30a2-4f2d-8567-bb4444b02b3f",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
